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DATABASE ANNOTATION AND RETRIEVAL 



The present invention relates to the annotation of data 
files which are to be stored in a database for 
5 facilitating their subsequent retrieval. The present 
invention is also concerned with a system for generating 
the annotation data which is added to the data file and 
to a system for searching the annotation data in the 
database to retrieve a desired data file in response to 
10 a user's input query. 

Databases of information are well known and suffer from 
the problem of how to locate and retrieve the desired 
information from the database quickly and efficiently. 
15 Existing database search tools allow the user to search 
the database using typed keywords. Whilst this is quick 
and efficient, this type of searching is not suitable for 
various kinds of databases, such as video or audio 
databases - 

20 

According to one aspect, the present invention aims to 
provide a data structure which will allow the annotation 
of data files within a database which will allow a quick 
and efficient search to be carried out in response to a 
25 user's input query. 



Exemplary embodiments of the present invention will now 
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be described with reference to Figures 1 to 10, in which: 

Figure 1 is a schematic view of a computer which is 
programmed to operate an embodiment of the present 
5 invention; 

Figure 2 is a block diagram showing a phoneme and word 
annotator unit which is operable to generate phoneme and 
word annotation data for appendage to a data file; 

10 

Figure 3 is a block diagram illustrating one way in which 
the phoneme and word annotator can generate the 
annotation data from an input video data file; 

15 Figure 4a is a schematic diagram of a phoneme lattice for 
an example audio string from the input video data file; 

Figure 4b is a schematic diagram of a word and phoneme 
lattice embodying one aspect of the present invention, 
20 for an example audio string from the input video data 
file; 

Figure 5 is a schematic block diagram of a user's 
terminal which allows the user to retrieve information 
25 from the database by a voice query; 

Figure 6a is a flow diagram illustrating part of the flow 



# 

3 2644001 
control of the user terminal shovm in Figure 5; 

Figure 6b is a flow diagram illustrating the remaining 
part of the flow control of the user terminal shown in 
Figure 5 ; 

Figure 7 is a flow diagram illustrating the way in which 
a search engine forming part of the user's terminal 
carries out a phoneme search within the database; 

Figure 8 is a schematic diagram illustrating the form of 
a phoneme string and four M-GRAMS generated from the 
phoneme string; 

Figure 9 is a plot showing two vectors and the angle 
between the two vectors; and 

Figure 10 is a schematic diagram of a pair of word and 
phoneme lattices, for example audio strings from two 
speakers ; 

Figure 11 is a schematic block diagram illustrating a 
user terminal which is operable to access a database 
located on a remote server via a data network in response 
to an input utterance by the user; 




Figure 12 is a schematic block diagram of a user terminal 
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which allows a user to access a database located in a 
remote server in response to an input utterance from the 
user ; 

Figure 13 is a schematic block diagram of a user terminal 
which allows a user to access a database by a typed input 
query ; and 

Figure 14 is a schematic block diagram illustrating the 
way in which a phoneme and word lattice can be generated 
from script data contained within a video data file. 

Embodiments of the present invention can be implemented 
using dedicated hardware circuits, but the embodiment to 
be described is implemented in computer software or code, 
which is run in conjunction with processing hardware such 
as a personal computer, work station, photocopier, 
facsimile machine, personal digital assistant (PDA) or 
the like. 

Figure 1 shows a personal computer (PC) 1 which is 
programmed to operate an embodiment of the present 
invention. A keyboard 3, a pointing device 5, a 
microphone 7 and a telephone line 9 are connected to the 
PC 1 via an interface 11. The keyboard 3 and pointing 
device 5 enable the system to be controlled by a user. 
The microphone 7 converts acoustic speech signals from 
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the user into equivalent electrical signals and supplies 
them to the PC 1 for processing. An internal modem and 
speech receiving circuit (not shown) is connected to the 
telephone line 9 so that the PC 1 can communicate with, 
5 for example, a remote computer or with a remote user. 

The programme instructions which make the PC 1 operate 
in accordance with the present invention may be supplied 
for use with an existing PC 1 on, for example, a storage 
10 device such as a magnetic disc 13, or by downloading the 
software from the Internet (not shown) via the internal 
modem and telephone line 9 • 

DATA FILE ANNOTATION 

15 Figure 2 is a block diagram illustrating the way in which 
annotation data 21 for an input data file 2 3 is generated 
by a phoneme and word annotating unit 25. As shown, the 
generated phoneme and word annotation data 21 is then 
combined with the data file 23 in the data combination 

20 unit 2 7 and the coitibined data file output thereby is 
input to the database 29. In this embodiment, the 
annotation data 21 comprises a combined phoneme (or 
phoneme like) and word lattice which allows the user to 
retrieve information from the database by a voice query. 

25 As those skilled in the art will appreciate, the data 
file 23 can be any kind of data file, such as, a video 
file, an audio file, a multimedia file etc. 



% 



g 2644001 



^ .ystem has been proposed to generate N-Best word lists 
for an audio stream as annotation data by passing the- 
audio data from a video data fiie through an automatic 
speech recognition unit. However, such word-based 
systems suffer from, a number of problems. These include 
,i) that state of the art speech recognition systems 
still make basic mistakes in recognition, (ii) that state 
of the art automatic speech recognition systems use a 
dictionary of perhaps 20,000 to 100,000 words and cannot 
produce words outside that vocabulary, and (iii) that the 
production of N-Best lists grows exponentially with the 
number of hypothesis at each stage, therefore resulting 
in the annotation data becoming prohibitively large for 
long utterances . 

Whilst the first of these problems may not be that 
significant if the same automatic speech recognition 
system is used to generate the annotation data and to 
subsequently retrieve the corresponding data file, since 
the same decoding error could occur. However, with 
advances in automatic speech recognition systems being 
made each year, it is likely that in the future the same 
type of error may not occu;, resulting in the Inability 
to be able to retrieve the corresponding data file at 
that later date. With regard to the second problem, this 
is particularly significant in video data applications, 
since users are likely to use names and places (which may 
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not be in the speech recognition dictionary) as input 



speech recognition system will typically replace the out 
of vocabulary words with a phonetically similar word or 
5 words within the vocabulary, often corrupting nearby 
decodings , This can also result in the failure to 
retrieve the required data file upon subsequent request. 

In contrast, with the proposed phoneme and word lattice 

10 annotation data, a quick and efficient search using the 
word data in the database 29 can be carried out and, if 
this fails to provide the required data file, then a 
further search using the more robust phoneme data can be 
performed. The phoneme and word lattice is an acyclic 

15 directed graph with a single entry point and a single 
exit point. It represents different parses of the audio 
stream within the data file. It is not simply a sequence 
of words with alternatives since each word does not have 
to be replaced by a single alternative, one word can be 

20 substituted for two or more words or phonemes, and the 
whole structure can foirm a substitution for one or more 
words or phonemes. Therefore, the density of data within 
the phoneme and word lattice essentially remains linear 
throughout the audio data, rather than growing 

25 exponentially as in the case of the N-Best technique 
discussed above. As those skilled in the art of speech 
recognition will realise, the use of phoneme data is more 
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In place of these names, the automatic 
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robust, because phonemes are dictionary independent and 
allow the system to cope with out of vocabulary words, 
such as names, places, foreign words etc. The use of 
phoneme data is also capable of making the system future 
proof, since it allows data files which are placed into 
the database to be retrieved even when the words were not 
understood by the original automatic speech recognition 
system. 

The way in which this phoneme and word lattice annotation 
data can be generated for a video data file will now be 
described with reference to Figure 3. As shown, the 
video data file 31 comprises video data 31-1, which 
defines the sequence of images forming the video sequence 
and audio data 31-2, which defines the audio which is 
associated with the video sequence. As is well known, 
the audio data 31-2 is time synchronised with the video 
data 31-1 so that, in use, both the video and audio, data 
are supplied to the user at the same time. 

AS shown in Figure 3, in this embodiment, the audio data 
31-2 is input to an automatic speech recognition unit 33, 
which is operable to generate a phoneme lattice 
corresponding to the stream of audio data 31-2. Such an 
automatic speech recognition unit 33 is commonly 
available in the art and will not be described in further 
detail. The reader is referred to, for example, the book 
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entitled 'Fundamentals of Speech Recognition' by Lawrence 
Rabiner and Biing-Hwang Juang and, in particular, to 
pages 42 to 50 thereof, for further information on this 
type of speech recognition system. 

Figure 4a illustrates the form of the phoneme lattice 
data output by the speech recognition unit 33, for the 
input audio corresponding to the phrase ' . . .tell me about 
Jason...'. As shown, the automatic speech recognition 
unit 33 identifies a number of different possible phoneme 
strings which correspond to this input audio utterance. 
For example, the speech recognition system considers that 
the first phoneme in the audio string is either a /t/ or 
a /d/ . As is well known in the art of speech 
recognition, these different possibilities can have their 
own weighting which is generated by the speech 
recognition unit 33 and is indicative of the confidence 
of the speech recognition unit's output. For example, 
the phoneme /t/ may be given a weighting of 0.9 and the 
phoneme /d/ may be given a weighting of 0.1, indicating 
that the speech recognition system is fairly confident 
that the corresponding portion of audio represents the 
phoneme /t/, but that it still may be the phoneme /d/ . 
In this embodiment, however, this weighting of the 
phonemes is not performed. 



As shown in Figure 3, the phoneme lattice data 35 output 
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.ha automatic speech recognition unit 33 is input to 
a word decoder 37 which is operable to identify possible 
„ords within the phone,ne lattice data 35. In thrs 
embodiment, the words identified by the word decoder 37 
are incorporated into the phoneme lattice data structure, 
ror example, for the phoneme lattice sho™> in Figure 4a, 
the word decoder 37 identifies the words ■tell', 'dell', 
'term', 'me', 'a', 'boat', 'about', 'chase' and 'sun'. 

shown in Figure 4b, these identified words are added 
to the phoneme lattice data structure output by the 
speech recognition unit 33, to generate a phoneme and 
word lattice data structure which forms the annotation 
data 31-3. This annotation data 31-3 is then combined 
with the video data file 31 to generate an augmented 
video data file 31' which is then stored in the database 
29. AS those skilled in the art will appreciate, in a 
similar way to the way in which the audio data 31-2 is 
time synchronised with the video data 31-1, the 
annotation data 31-3 is also time synchronised and 
associated with the corresponding video data 31-1 and 
audio data 31-2, so that a desired portion of the video 
and audio data can be retrieved by searching for and 
locating the corresponding portion of the annotation data 



31-3 



25 



in this embodiment, the annotation data 31-3 stored in 
the database 29 has the following general form: 
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HEADER 

- time of start 

- flag if word if . phoneme if mixed 

- time index associating the location of 
blocks of annotation data within memory to 
a given time point, 

- word set used (i.e. the dictionary) 

- phoneme set used 

- the language to which the vocabulary 
pertains 

Block(i) i = 0,1/2, 

node Nj j = 0,1/2, 

- time offset of node from start of block 

- phoneme links (k) k = 0,1,2 

offset to node Nj = N^-N^ (Nj^ is node to 
which link K extends) 

phoneme associated with link (k) 

- word links (1) 1 = 0,1,2, 

offset to node Nj = - Nj (Nj is node 
to which link 1 extends ) 

word associated with link (1) 

The time of start data in the header can identify the 
time and date of transmission of the data. For example, 
if the video file is a news broadcast, then the time of 
start may include the exact time of the broadcast and the 
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date on which it was broadcast. 

The flag identifying if the annotation data is word 
annotation data, phoneme annotation data or if it is 
5 mixed is provided since not all the data files within the 
database will include the combined phoneme and word 
lattice annotation data discussed above, and in this 
case, a different search strategy would be used to search 
this annotation data. 

10 

In this embodiment, the annotation data is divided into 
blocks in order to allow the search to jump into the 
middle of the annotation data for a given audio data 
stream. The header therefore includes a time index which 
15 associates the location of the blocks of annotation data 
within the memory to a given time offset between the time 
of start and the time corresponding to the beginning of 
the block. 

20 The header also includes data defining the word set used 
(i.e. the dictionary), the phoneme set used and the 
language to which the vocabulary pertains. The header 
may also include details of the automatic speech 
recognition system used to generate the annotation data 

25 and any appropriate settings thereof which were used 
during the generation of the annotation data. 
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The blocks of annotation data then follow the header and 
identify^ for each node in the block, the time offset of 
the node from the start of the block, the phoneme links 
which connect that node to other nodes by phonemes and 
word links which connect that node to other nodes by 
words. Each phoneme link and word link identifies the 
phoneme or word which is associated with the link. They 
also identify the offset to the current node. For 
example, if node N50 is linked to node N55 by a phoneme 
link, then the offset to node N50 is 5 . As those skilled 
in the art will appreciate, using an offset indication 
like this allows the division of the continuous 
annotation data into separate blocks . 

In an embodiment where an automatic speech recognition 
unit outputs weightings indicative of the confidence of 
the speech recognition units output, these weightings or 
confidence scores would also be included within the data 
structure. In particular, a confidence score would be 
provided for each node which is indicative of the 
confidence of arriving at the node and each of the 
phoneme and word links would include a transition score 
j^depending upon the weighting given to the corresponding 
phoneme or word. These weightings would then be used to 
control the search and retrieval of the data files by 
discarding those matches which have a low confidence 
score . 
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naTA FIL E RETRIEV AL 

Figure 5 is a block diagram illustrating the for™ of a 
user terminal 59 which can be used to retrieve the 
annotated data files fro. the database 29. This user 
terminal 59 may be, for example, a personal computer, 
hand held device or the like. As shown, in thrs 
embodiment, the user terminal 59 comprises the database 
„ of annotated data files. an automatic speech 
recognition unit 51, a search engine 53, a control unit 
0 55 and a display 57. In operation, the automatic speech 
recognition unit 51 is operable to process an input voice 
guery from the user 39 received via the microphone 7 and 
the input line 61 and to generate therefrom corresponding 

5 of a phoneme and word lattice, but this is not essentral. 
This phoneme and word data is then input to the control 
unit 55 Which is operable to initiate an appropriate 
search of the database 29 using the search engine 53. 
.he results of the search, generated by the search engine 
20 53, are then transmitted back to the control unit 55 
which analyses the search results and generates and 
displays appropriate display data to the user via the 
display 57 . 

25 Figures 6a and 6b are flow diagrams which illustrate the 
way in which the user terminal 59 operates in this 
embodiment. In step si, the user terminal 59 is in an 
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idle state and awaits an input query from the user 39. 
Upon receipt of an input query, the phoneme and word data 
for the input query is generated in step s3 by the 
automatic speech recognition unit 51. The control unit 
55 then instructs the search engine 53, in step s5, to 
perform a search in the database 29 using the word data 
generated for the input query. The word search employed 
in this embodiment is the same as is currently being used 
in the art for typed keyword searches, and will not be 
described in more detail here. If in step si, the 
control unit 55 identifies from the search results, that 
a match for the user's input query has been found, then 
it outputs the search results to the user via the display 
57 . 

In this embodiment, the user terminal 59 then allows the 
user to consider the search results and awaits the user's 
confirmation as to whether or not the results correspond 
to the information the user requires. If they are, then 
the processing proceeds from step sll to the end of the 
processing and the user terminal 59 returns to its idle 
state and awaits the next input query. If, however, the 
user indicates (by, for example, inputting an appropriate 
voice command) that the search results do not correspond 
to the desired information, then the processing proceeds 
from step sll to step sl3, where the search engine 53 
performs a phoneme search of the database 29. However, 
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. - ^ the phoneme search performed in step 
in this embodiment, the pho 

^,3 .ot of the whole database 29, srnce 

^b^. size of the database 
talce several hours depending on the sxze 
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H to the user's input query. The way rn 
may correspond to the user f , ,13 is 

=»,rch performed in step sl3 is 
which the phoneme search per 

^ will be described in more 
performed in this embodrment, wrll be 

Tt 1 later After the phoneme search has been 
detail later. ^ 

^^r.^ unit 55 identifies, m step sl5, 
performed, the control unxt 

L a match has been found. Xf a match has been found 
then the processing proceeds to step Bi7 where the 
oltroi unit . causes the search results to be dispiaye 

.„aits the user's confirmation as to whether or not the 
search results correspond to the desired informatron. 

are correct, then the processing passes 
If the results are cuio. 

^ .^d the user terminal 59 returns 
from step sl9 to the end and the user 

.0 its idle state and awaits the next input query. « 

ho„e.er, the user indicates ,hat the search r.ults ^ 

„ot correspond to the desired information, then he 

■„„ oroceeds from step sl9 to step s21, where the 
processing proceeds 

control unit 35 is operable to as. the user. 
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performed of the whole database 29. If in response to 
this query, the user indicates that such a search should 
be performed, then the processing proceeds to step s23 
where the search engine performs a phoneme search of the 
entire database 29. 

On completion of this search, the control unit 55 
identifies, in step s25, whether or not a match for the 
user's input query has been found. If a match is found, 
then the processing proceeds to step s27 where the 
control unit 55 causes the search results to be displayed 
to the user on the display 57. If the search results are 
correct, then the processing proceeds from step s2 9 to 
the end of the processing and the user terminal 59 
returns to its idle state and awaits the next input 
query- If, on the other hand, the user indicates that the 
search results still do not correspond to the desired 
information, then the processing passes to step s31 where 
the control unit 55 queries the user, via the display 57, 
whether or not the user wishes to redefine or amend the 
search query. If the user does wish to redefine or amend 
the search query, then the processing returns to step s3 
where the user's subsequent input query is processed in 
a similar manner. If the search is not to be redefined 
or amended, then the search results and the user's 
initial input query are discarded and the user terminal 
59 returns to its idle state and awaits the next input 
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PHONEME SEARCH 

AS mentioned above, in steps sl3 and s23, the search 
engine 53 corapares the phoneme data of the input query 
with the phoneme data in the phoneme and word lattice 
annotation data stored in the database 29. Various 
techniques can be used including standard pattern 
matching techniques such as dynamic programming, to carry 
out this comparison. In this embodiment, a technique 
which we refer to as M-GRAMS is used. This technique was 
proposed by Ng, K. and Zue, V.W. and is discussed in, for 
example, the paper entitled "Subword unit representations 
for spoken document retrieval" published in the 
proceedings of Eurospeech 1997. 

The problem with searching for individual phonemes is 
that there will be many occurrences of each phoneme 
within the database. Therefore, an individual phoneme 
on its own does not provide enough discriminability to 
be able to match the phoneme string of the input query 
with the phoneme strings within the database. Syllable 
sized units, however, are likely to provide more 
discriminability, although they are not easy to identify. 
The M-GRAM technique presents a suitable compromise 
between these two possibilities and takes overlapping 
fixed size fragments, or M-GRAMS, of the phoneme string 
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to provide a set of features. This is illustrated in 
Figure 8, which shows part of an input phoneme string 
having phonemes a, b, d, e, and f, which are split 

into f our M-GRAMS (a, b, c), (b, c, d), (c, d, e) and (d, 
e^ f). In this illustration, each of the four M-GRAMS 
comprises a sequence of three phonemes which is unique 
and represents a unique feature (f^) which can be found 
within the input phoneme string . 

Therefore, referring to Figure 7, the first step s51 in 
performing the phoneme search in step si 3 shown in Figure 
6, is to identify all the different M-GRAMS which are in 
the input phoneme data and their frequency of occurrence. 
Then, in step s53, the search engine 53 determines the 
frequency of occurrence of the identified M-GRAMS in the 
selected portion of the database (identified from the 
word search performed in step s5 in Figure 6). To 
illustrate this, for a given portion of the database and 
for the example M-GRAMS illustrated in Figure 8, this 
yields the following table of information: 



M-GRAM 
( feature ( f i) ) 


Input phoneme 
string frequency 
of occurrence 

(a) 


Phoneme string 
of selected 
portion of 
database 
(a) 


Mi 


1 


0 


M2 


2 


2 
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• . ^nd the selected portion of the a 
string and the ^^^^^^ ^^.^ 

.h. anale between these vectors should 

. -n Figure 9 for two-dimensional vectors 
is illustrated m Figure . ^„ =,s 

: ^n the e..ple shown in ..re S, the vectors . 
1 , Will be .our dimensional vectors and 
similarity score can be calculated from: 
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This score is then associated with the current selected 
portion of the database and stored until the end of the 
search- In some applications, the vectors used in the 
calculation of the cosine measure will be the logarithm 
of these frequencies of occurrences, rather than the 
frequencies of occurrences themselves . 

The processing then proceeds to step s5 7 where the search 
engine 5 3 identifies whether or not there are any more 
selected portions of phoneme strings from the database 
29. If there are, then the processing returns to step 
s53 where a similar procedure is followed to identify the 
score for this portion of the database. If there are no 
more selected portions, then the searching ends and the 
processing returns to step sl5 shown in Figure 6, where 
the control unit considers the scores generated by the 
search engine 53 and identifies whether or not there is 
a match by, for example, comparing the calculated scores 
with a predetermined threshold value. 

As those skilled in the art will appreciate, a similar 
matching operation will be performed in step s23 shown 
in Figure 6. However, since the entire database is being 
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.h is carried out by searching each 
searched, this search xs car 

bloclcs discussed above in turn. 

^yrEm^m^.^^^S^^'^^ this type 

17^ in the art will appreciate, thrs 
AS those skxlled ^ 

H word annotation of data tix 
of phonetic and word an 

■H.s a convenient and powerful way to a 
database provides a con 

.ser to search the database by vorce . 

t a single audio data streaxn was 
•imstrated embodiment, a singx 

illustrai^eu subsequent 

A in the database for suu^ ^ 

„«ieval by .he user. ^^^^ ^ 

.pp.ec.a.e, w.an the .np.. .a., f ^ ^^^^ 

-.ua. au..o ^ ^^^^^^^^^^ _ 

instead of generating a srngle , . .^^^^ee 

audio data, separate p.one^e and word lattrc 

. ta can .e generated for the audio data of 
annotation data can g fro-n 
each spea^cer. This .ay be achre Y 

•..h or from another distinguishing feature 

i - audio data which corresponds to each 
0 speech signals, the ^^^^^^^ ^^^^^^^^^ 

of the speakers and then by „hleved 

V .s audio separately. This may also be achie 
speaker s audio . ^ ^^^^^ 

the audio data was recorded st ^^^^^ 
microphones were used in generating 

=,lble to process the audio data to 
since it is then possible to p 

extract the data for each speaker. 
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Figure 10 illustrates the form of the annotation data in 
such an embodiment, where a first speaker utters the 
words this so" and the second speaker replies "yes". 

As illustrated, the annotation data for the different 
speakers' audio data are time synchronised, relative to 
each other, so that the annotation data is still time 
synchronised to the video and audio data within the data 
file. In such an embodiment, the header information in 
the data structure should preferably include a list of 
the different speakers within the annotation data and, 
for each speaker, data defining that speaker's language, 
accent, dialect and phonetic set, and each block should 
identify those speakers that are active in the block. 

In the above embodiments, a speech recognition system was 
used to generate the annotation data for annotating a 
data file in the database. As those skilled in the art 
will appreciate, other techniques can be used to generate 
this annotation data. For example, a human operator can 
listen to the audio data and generate a phonetic and word 
transcription to thereby manually generate the annotation 
data . 

If 

In the above embodiment, the database 29 and the 
automatic speech recognition unit were both located 
within the user terminal 59 - As those skilled in the art 
will appreciate, this is not essential. Figure 11 
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v.ir-h the database 29 and 
illustrates an exnbodiment xn whxch 

. . 53 are located in a remote server 60 
the search engrne 53 are H.tabase 

• ^1 accesses the database 
and in which the user terxnxnal 59 

f.1 and 69 and a data 
,3 via .he network interface u„.t. 67 and 

..e. input, a voice ,u«v via .he .ic.ophone . wh.ch . 
„ve.l into an. »o.. .a« .he a..o^.i 

. J ic 1-Yien passed to 
.peech reco,ni.ion .ni. 51. Th.s .ata .= .he 

le control uni. v,hich con.rois .he .«ns..s=.on o. .hx= 
: ne.e an. »o.. .a.a eve. .he aa.a ne.«o. . ^ - 
ea.ch engine S3 ioca.e. wi.hin .he — 

• ^ the search m a 

• ci-^ then carries out 
The search engine 53 then u ^ ^ 

• whirh the search was 
^ +-0 the way m whicn 
similar manner to the wy 

H m the first embodiment. The results of 

performed xn tne xx-ld 

Larch a.e .hen ..an..i..e. .ac. t.o. .he .ea.c e 

V 1 nnit 55 via the data network 68. The 
53 to the control unit 5b vi 

oon..oi uni. con.i.e„ .he search re.ui.. 

ne.»or. an. .i.piays appropriate .ata on 
.i.play57 for viewing by the user 39. 

' . . .o locating the database 29 and .he search 

in addr.ron .o locatr ^^.^^^ 

engine 53 in the remote server 60, rt x 
Jiocate the automatic speech recogni.ion uni. . rn t^ 
remote server 60. Such an e^odiment is sho™. m .rgur^ 
sho>™ in this embodimen.. .he inpu. voxce .uery 

encoding unit 73 which is operable to encode .he speech 
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for efficient transfer through the data network 68. The 
encoded data is then passed to the control unit 55 which 
transmits the data over the network 68 to the remote 
server 60, where it is processed by the automatic speech 
recognition unit 51. The phoneme and word data generated 
by the speech recognition unit 51 for the input query is 
then passed to the search engine 53 for use in searching 
the database 29. The search results generated by the 
search engine 53 are then passed, via the network 
interface 6 9 and the network 68, back to the user 
terminal 59. The search results received back from the 
remote server are passed via the network interface unit 
67 to the control unit 55 which analyses the search 
results and generates and displays appropriate data on 
the display 57 for viewing by the user. 

In the above embodiments, the user inputs his query by 
voice. Figure 13 shows an alternative embodiment in 
which the user inputs the query via the keyboard 3 . As 
shown, the text input via the keyboard 3 is passed to 
phonetic transcription unit 75 which is operable to 
generate a corresponding phoneme string from the input 
text. This phoneme string together with the words input 
via the keyboard 3 are then passed to the control unit 
55 which initiates a search of database using the search 
engine 53. The way in which this search is carried out 
is the same as in the first embodiment and will not. 
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therefore, be describe, a.ain. A. with the o.ner 

e^odi-nents aiecu.sed above, the phonetic transcription 

■ «. n and/or the database 29 may all 
unit 75, search engme 53 and/ or t 

be located in a remote server. 

in the tirst embodiment, the audio data from the data 
file 31 was passed through an automatic speech 
recognition unit in order the generate the phoneme 
annotation data. In some situations, a transcript of the 
audio data will be present in the data file. Such an 
embodiment is illustrated in Figure U. m thxs 
embodiment, the data file 81 represents a digital video 

data Bl-3 which defines the lines for the various actors 
m the video film. As shown, the script data 81-3 rs 
passed through a text to phoneme converter 83, whrch 
generates phoneme lattice data 85 using a stored 
dictionary which translates words into possible sequences 
phonemes. This phoneme lattice data 85 is then 

. . M-v, the script data 81-3 to generate the above 

\ combined with the scrxpu ^ 

^ ^-r-A ifli-t-ire annotation data 81-4 • 
described phoneme and word lattice ann 

is then added to the data file 81 
This annotation data is men 

.o generate an augmented data file 81' which is then 
added to the database 2S. As those s.illed in the art 
5 will appreciate, this embodiment facilitates the 
generation of separate phoneme and word lattice 
annotation data for the different speaKers within the 
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video data file, since the script data usually contains 
indications of who is talking. The synchronisation of 
the phoneme and word lattice annotation data with the 
video and audio data can then be achieved by performing 
a forced time alignment of the script data with the audio 
data using an automatic speech recognition system (not 
shown ) . 
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CLAIMS : 

1 . An apparatus for generating annotation data for use 
in annotating a data file comprising audio data, the 
apparatus comprising : 

an automatic speech recognition system for 
generating phoneme data for the audio data in the data 
file; 

a word decoder for identifying possible words within 
the phoneme data generated by the automatic speech 
recognition system; and 

means for combining the phoneme data and the decoded 
words to generate annotation data defining a phoneme and 
word lattice for the audio data in the data file; 

wherein said combining means comprises: • 

(i) means for generating data defining a plurality 
of nodes within the lattice and a plurality of links 
connecting the nodes within the lattice; and 

(ii) means for generating data associating a 
plurality of phonemes of the phoneme data with a 
respective plurality of links and for associating at 
least one of the identified words with at least one of 
said links . ^ 

2. An apparatus for generating annotation data for use 
in annotating a data file comprising text data, the 
apparatus comprising : 
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a text to phoneme converter for generating phoneme 
data for the text data in the data file; and 

means for combining the phoneme data and the words 
in the text data to generate annotation data defining a 
5 phoneme and word lattice for the text data in the data 
file; 

i^^herein said combining means comprises: 

(i) means for generating data defining a plurality 
of nodes within the lattice and a plurality of links 

10 connecting the nodes within the lattice; and 

(ii) means for generating data associating a 
plurality of phonemes of the phoneme data with a 
respective plurality of links and for associating at 
least one of the identified words with at least one of 

15 said links. 

3. An apparatus according to claim 1 or 2 , wherein said 
combining means is operable to generate said data 
defining said phoneme and word lattice in blocks of said 

20 nodes. 

4. An apparatus according to any preceding claim, 
wherein said combining means is operable to generate data 
defining time stamp information for each of said nodes . 

25 

5. An apparatus according to claim 4, wherein said 
combining means is arranged to generate said phoneme and 



3Q 2644001 



word lattice data in blocks of equal time duration. 

6 . An apparatus according to claim 3 or 5, wherein said 

n« ooerable to generate data which 
combining means is operaoxe ^ y 

5 defines each block's location within a database. 

7 An apparatus according to claim 4 or any claim 
dependent thereon, wherein said data file includes a time 
sequential signal, and wherein said combining means is 

10 operable to generate time stamp data which is time 
synchronised with said time sequential signal. . 

8 . An apparatus according to claim 7 , wherein said time 
sequential signal is an audio and/or video signal. 

9. An apparatus according to claim 1 or any claim 
dependent thereon, wherein said audio data includes audio 
data which defines the parol of a plurality of speakers, 
and Wherein said combining means is operable to generate 

20 data which defines separate phoneme and word lattice 
annotation data for the parol of each speaker. 

10. An apparatus according to claim 2 or any claim 
dependent thereon, wherein said text data defines the 
25 parol of a plurality of speakers, and wherein said 
combining means is operable to generate data defining 
separate phoneme and word lattice annotation data for the 
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parol of each speaker. 

11. An apparatus according to claim 1 or any claim 
dependent thereon, wherein said speech recognition system 

5 is operable to generate data defining a weighting for the 
phonemes associated with said links. 

12. An apparatus according to claim 1 or any claim 
dependent thereon, wherein said word decoder is operable 

10 to generate data defining a weighting for the words 
associated with said links. 

13. An apparatus according to any preceding claim, 
wherein said means for defining a plurality of nodes and 

15 a plurality of links is operable to define at least one 
node which is connected to a plurality of other nodes by 
a plurality of links . 

14. An apparatus according to claim 13, wherein at least 
20 one of said plurality of links connecting said node to 

said plurality of other nodes is associated with a 
phoneme and wherein at least one of said links connecting 
said node to said plurality of other nodes is associated 
with a word . 

25 

15. A method of generating annotation data for use in 
annotating a data file comprising audio data, the method 
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comprising the steps of: 

using an automatic speech recognition system to 
generate phoneme data for the audio data in the data 
file; 

using a word decoder to identify possible words 
within the phoneme data generated by the automatic speech 

recognition system; and 

combining the phoneme data and the decoded words to 
generate annotation data defining a phoneme and word 
lattice for the audio data in the data file; 

wherein said combining step comprises the steps of : 
(i) generating data defining a plurality of nodes 
within the lattice and a plurality of links connecting 
the nodes within the lattice; and 

' (ii) generating data : associating a - plurality of 
phonemes of the phoneme data with a respective plurality 
of links and for associating at least one of. the 
identified words with at least one of said links. 

16. A method of generating annotation data for use in 
annotating a data file comprising text data, the method 
comprising the steps of: 

using a text to phoneme converter to generate 
phoneme data for the text data in the data file; and 

combining the phoneme data and the words in the text 
data to generate annotation data defining a phoneme and 
word lattice for the text data in the data file; 
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wherein said combining step comprises the steps of: 
(i) generating data defining a plurality of nodes 
within the lattice and a plurality of links connecting 
the nodes within the lattice; and 
5 (ii) generating data associating a plurality of 

phonemes of the phoneme data with a respective plurality 
of links and for associating at least one of the 
identified words with at least one of said links. 

10 17. A method according to claim 15 or 16, wherein said 
combining step generates said data defining said phoneme 
and word lattice in blocks of said nodes . 

18. A method according to any of claims 15 to 11 , 
15 wherein said combining step generates data defining time 

stamp information for each of said nodes . 

19. A method according to claim 18, wherein said 
combining step generates said phoneme and word lattice 

20 data in blocks of equal time duration. 

20. A method according to claim 17 or 19, wherein said 
combining step generates data which defines each block's 
location within a database. 

25 

21. A method according to claim 18 or any claim 
dependent thereon, wherein said data file includes a time 
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sequential signal, and wherein said combining step 
generates time stamp data which is time synchronised with 
said time sequential signal. 

22. A method according to claim 21, wherein said time 
sequential signal is an audio and/or video signal. 

23. A method according to claim 15 or any claim 
dependent thereon, wherein said audio data includes audio 
data which defines the parol of a plurality of speakers, 
and wherein said combining step generates data which 
defines separate phoneme and word lattice annotation data 
for the parol of each speaker. 

24. A method according to claim 16 or any . claim 
dependent thereon, wherein said text data defines the 
parol of a plurality of speakers, and wherein said 
combining step generates data defining separate phoneme 
and word lattice annotation data for the parol of each 
Speaker . 

25. A method according to claim 15 or any claim 
dependent thereon, wherein said speech recognition system 
generates data defining a weighting for the phonemes 
associated with said links. 

26. A method according to claim 15 or any claim 
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dependent thereon, wherein said word decoder generates 
data defining a weighting for the words associated with 
said links . 

27. A method according to any of claims 15 to 26, 
wherein said step of defining a plurality of nodes and 
a plurality of links defines at least one node which is 
connected to a , plurality of other nodes by a plurality 
of links . 

28- A method according to claim 27, wherein at least one 
of said plurality of links connecting said node to said 
plurality of other nodes is associated with a phoneme and 
wherein at least one of said links connecting said node 
to said plurality of other nodes is associated with a 
word • 



J 36 2644001 

ABSTRACT 

nAT ABASE AN^T^T^TTON AND RF.TRIEVAL 

A data structure is provided for annotating data files 
5 within a database. The annotation data comprises a 
phoneme and word lattice which allows the quick and 
efficient searching of data files within the database, 
in response to a user's input query for desired 
information. The structure of the annotation data is 
such that it allows the input query to be made by voice 
and can be used for annotating various kinds of data 
files, such as audio data files, audio and visual data 
files, multimedia data files etc. 
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